perm filename CMU[TLK,DBL]1 blob
sn#201938 filedate 1976-02-14 generic text, type C, neo UTF8
COMMENT ⊗ VALID 00011 PAGES
C REC PAGE DESCRIPTION
C00001 00001
C00003 00002 .COMMENT !XGPCOMMANDS←"/PMAR=2200"
C00005 00003 This talk is about the process of discovery and invention in science.
C00015 00004 We've performed a sequence of reductions, <chain SLIDE> from primes
C00027 00005 INTRODUCE HEURISTIC GUIDANCE
C00037 00006 The first step was to list all the concepts the system would know
C00047 00007 What does the system actually do, now? From a heuristic search
C00059 00008 Just as with the starting concepts, a question arises as to how to
C00075 00009 CARDINALITY EXAMPLE: OPTIONAL
C00112 00010 GOOD RUN
C00124 00011 SUPPLEMENTARY: Finding canonizing function for numbers
C00129 ENDMK
C⊗;
.COMMENT !XGPCOMMANDS←"/PMAR=2200";
.DEVICE XGP
.FONT 1 "BDR40"
.FONT 2 "NGR40"
.FONT 4 "BDI40"
.FONT 5 "BDR66"
.FONT 6 "NGR25"
.FONT 7 "NGR20"
.FONT 8 "SUP"
.FONT 9 "SUB"
.TURN ON "↑α↓_π[]{"
.TURN ON "⊗" FOR "%"
.TURN ON "@" FOR "%"
.PAGE FRAME 39 HIGH 83 WIDE
.AREA TEXT LINES 1 TO 38
.COUNT PAGE PRINTING "1"
.TABBREAK
.ODDLEFTBORDER←EVENLEFTBORDER←1000
.MACRO B ⊂ BEGIN VERBATIM GROUP ⊃
.MACRO E ⊂ APART END ⊃
.MACRO FAD ⊂ FILL ADJUST DOUBLE SPACE PREFACE 2 ⊃
.MACRO FAS ⊂ FILL ADJUST SINGLE SPACE PREFACE 1 ⊃
.FAS
.COMPACT
.SELECT 1
.PORTION THESIS
.PAGE←0
.NEXT PAGE
.INDENT 0
.TURN OFF "{∞→}"
.BEGIN CENTER
⊗5↓_Automating the Discovery_↓⊗*
⊗5↓_of Mathematical Concepts_↓⊗*
Text for a seminar at CMU
Monday, Feb. 16, 1976
⊗2Douglas B. Lenat⊗*
Artificial Intelligence Lab
Stanford University
.END
This talk is about the process of discovery and invention in science.
We'll look at a few examples of discovery, use them to propose a
simple model for theory formation, and describe a computer program
embodying that model which has carried out some new mathematical
research on its own.
The first thing I want to stress is the difference between solving a
given problem and thinking it up in the first place. Contrast
playing monopoly, to inventing new games of the same quality. Or
contrast solving the missionaries and canibals problem, against the
ill-defined reasoning which led to formulating it.
If I show you the definition of prime numbers, <SLIDE primes> you can
find some for me without any trouble. Whoever first formulated that
concept performed a much harder synthesis.
HOW TO ACCOUNT FOR THE DISCOVERY OF PRIMES
How might we account for a discovery, like prime numbers? Say a
ten-year-old has just heard of these numbers, and he asks you "How
did anybody ever think up that idea?" So you have to quickly
rationalize this discovery; you have to create a plausible scenario
in which someone decides to look at them, and then decides they're
interesting and worth naming.
Here's one such scenario of their discovery:
A researcher already knows about counting and arithmetic, and he's
just been playing with the notion of "divisors-of" of a number. One
rule of thumb he sometimes follows when devising new theories is to
consider extreme cases. <SLIDE goto extremes> In more precise terms,
this rule says that if we've just been studying an interesting
function f which takes elements of set A into those of set B, and
there is some extremal member of B, it's worth our time to consider
which elements from A map into that element. The more unusual an item
is, the more time you should spend investigating its inverse image.
Say our sets A and B are the natural numbers. Extremal members of
that set are the smallest numbers: 0, 1, 2, 3, or 4. If the function
is "number-of-divisors-of", the rule of thumb says to construct and
investigate the set of numbers with no divisors, with only 1 divisor,
with 2 divisors <SLIDE n divisors>, with three divisors, and so on.
A priori, any or all of these sets could be interesting. It will
turn out that ⊗4this⊗* set (2 divisors) is the most interesting and
useful one. Why?
After examining some more empirical data, the researcher may notice
that all numbers seem to factor uniquly into numbers drawn from this
set. He'll conjecture the unique factorization theorem <SLIDE UFT>
and notice that it can be made much less awkward if we give this set
a name, <SLIDE: Primes overlay> say PRIME NUMBERS.
WE DID A PROBLEM REDUCTION
We've just reduced our problem from "how in the world could somebody
invent the notion of prime numbers" to the somewhat simpler task of
"how in the world could somebody invent the notion of divisors-of".
<SLIDE Reduc 1> The reduction was accomplished by citing a heuristic,
which said:
⊗4"Investigate those items having unusual or extreme images under
some operation."⊗*
In this case, our possible items are numbers, the extreme image was
the very small number "2", and the operation was
number-of-divisors-of. The inverse image of 2 turned out to be the
set of numbers we know as the primes.
"Looking for extreme cases"
is a pretty general heuristic.
I'ts not a rule of inference, just a rule of thumb; it doesn't
guarantee anything, in the sense that modus ponens <SLIDE overlay>
guarantees to preserve validity. It certainly isn't
applicable or useful all the time. And yet it's worth remembering
and using because it frequently leads to interesting, valuable new
concepts to investigate. In the long run, using this heuristic is
cost-effective.
Say the ten-year-old persists in asking why; why did anybody think
of this concept, divisors-of? By now you know what to do. You grab
some rule of thumb, and use it to rationalize that discovery.
In this case, you might realize that a reasonable tactic is to
investigate the inverse of each interesting relation. That accounts
for this discovery, <inverse OVERLAY> reduces it to the discovery of
the concept of multiplication, because looking at the inverse of
multiplication means discovering this concept, divisors-of.
If you can stand it, you might try to continue this reduction even
further. So you want to rationally explain how ⊗4this⊗* concept
might have been discovered.
Suppose we knew how to count the number of elements in a set, and how
to take the cross-product of two sets. Then we might display our
knowledge pictorially thus, < read thru: SLIDE: square> and a quite
natural question is "what is this missing operation?". We can
compute it easily, at least in this direction. <OVERLAY: square> For
example, if we take these two sets, their sizes are 2 and 3. But this
is their cross-product, and ⊗4its⊗* size is 6. So our mystery
operation takes the numbers 2,3 and returns the number 6. Sure
enough, this operation will turn out to be multiplication.
Mathematicians call this activity chasing-around-diagrams, and it's
one of their favorite games.
Here multiplication was discovered as a numeric analog to cross-product.
An alternative, and perhaps more common derivation, is to consider
multiplication as repeated addition.
.SELECT 1
We've performed a sequence of reductions, <chain SLIDE> from primes
to divisors-of to multiplication to counting and cross-product. Each
step was justified by some rule of thumb: Here it is "consider
extremals", here it is "consider the inverse of an interesting
relation", here it is "complete the square diagram".
You could continue this analysis, but at some point the concepts
you'd have to deal with are so elementary that you feel it makes no
sense to talk about "discovering" them. Your strategy then changes,
and instead of trying to think of a heuristic, you try to get the
10-year old to leave the room.
We can only repeat this analytical process until our original task
has been reduced to the "discovery" of concepts we consider to be
conceptual primitives. Of course this is a personal barrier, and
varies from individual to individual.
When we hear of a new discovery, we mentally try to analyze it in
this fashion ourselves. We try to perceive a chain of concepts,
stretching from that ⊗4latest⊗* discovery backwards, not to our
conceptual primitives, but at least all the way to concepts we
already are familiar with. If we succeed right away, we probably
think that the discovery wasn't so hard, that we could have done it
ourselves.
Let me digress for a moment, and discuss 2 reasons why we sometimes
⊗4fail⊗* to account for a discovery:
.BEGIN FILL INDENT 4,8,0 PREFACE 0
(1) because we're missing a heuristic which is crucial to this line
of development;
(2) because the chain is too long for us to construct easily.
Why might we lack some heuristic? Either because we're not experts in
this field, or, occasionally, because the researcher has used a new
heuristic.
Why might the chain be too long for us to perceive? Usually because
we simply weren't aware of all the current concepts which the
discoverer had at his disposal. So our chain had to be much longer
than his, it had to reach further down before hitting concepts that
⊗4we⊗* knew. Occasionally, because the discovery ⊗4was⊗* in fact a
major advance over pre-exisiting knowledge.
.END
.SELECT 1
If you're a novice in some field, then you'll suffer from both these
deficiencies, and you may find all the discoveries in that field
amazing. Often we find two friends, like a nuclear physicist and a
molecular geneticist, who can't converse about their research.
Norbert Weiner was always decrying this kind of specialization, he
used to advocate interdisciplinary research. This simple model shows
us why. it might be worth the trouble to learn all those alien
concepts. One big advantage you have if you enter a field from
another speciality is that you bring with you a whole bundle of new
heuristics. Many of them will be inapplicable, but if even one or two
can carry over, then you're using techniques that no one else uses,
so you may make some advances right away, as soon as you pick up
enough of this jargon, these concepts to work from. You also bring
with you many new concepts, and that means you have some unique raw
materials to draw analogies from.
A friend of mine was able to breeze through organic chemistry because
he knew about means-ends analysis and found it could be used there.
Phrased in terms of operators, goals, and pre-conditions, each
chemical synthesis problem became a simple means-ends search.
Meanwhile, his classmates were relying on rote memorization.
This is the end of the digression. From now on, we'll assume that the
researcher is working in his own speciality, that he knows most of
these concepts and heuristics.
Very often, a new discovery is really just a couple steps in this
kind of process, one or two heuristic applications away from what is
already known. That is why it's often easy for us to believe that
there wasn't really so much effort needed to make the discovery. If
we already know this concept, and this heuristic, it almost seems
obvious that someone would make this step.
Why then don't we make new discoveries every few minutes? Why are
even some 1-step advances worth publishing? The answer can be found
by seeing how (according to this model) new discoveries are made.
We're talking now about reversing the flow of time here. Instead of
analyzing a discovery by breaking it into simpler and simpler ones,
we're now progressing in this direction, trying to grow new nodes on
this tree, using whatever heuristics are available.
The researcher has a bunch of concepts that he knows about, say a few
thousand. I won't even consider his ⊗4legal⊗* choices, since they
are quite literally infinite in number. Say he knows a few hundred
heuristics he can apply. In any situation, then, he may literally
have a million alternatives to consider, <OVERLAY of red lines>.
These are not his ⊗4legal⊗* choices, but his ⊗4plausible⊗* ones; each
and every one is justified by some heuristic, like this link is.
But nature is not kind to us, and only a very small percentage of
those new concepts will turn out to be interesting. What is even
worse, you can't whiz through this space, stopping at the interesting
concepts, because it may require months of effort to decide whether
the concept you're studying is worthwhile or a dead-end.
Even here, where we have just a few concepts and heuristics, the tree
gets bushy when we try to go in this direction. We could apply
"looking at inverse" here, (cross-product) to decide to investigate
the kind of operation we normally call projection. Or here
(divisors-of) even if we decide to look for extremals, which way do
we go? Why not examine numbers which have very many, rather than
very few, divisors?
The analysis procedure was a search for a path from a given node back
to primitives, but the synthesis procedure is blind, doesn't have a
well-specified goal, so it is combinatorially more explosive.
So it's harder to make a valuable new discovery than to rationalize
how someone might plausibly have discovered something you've already
seen. This explains why discoveries sometimes seem so obvious
⊗4after the fact⊗*, even to their inventor who may have labored years
on the problem.
The same let-down occurs when a magician shows you how he manages
some magic trick. For example, at talks like this one. where the
speaker states some amazing behavior his program exhibits, but when
he explains how it works, it loses its impact.
.SELECT 1
INTRODUCE HEURISTIC GUIDANCE
The standard Artificial Intelligence method for muffling an
explosion of red arrows is to introduce purple and green heuristic
guidance. <2nd OVERLAY: green/purple>: purple strategies highlight
the most promising nodes to work on, and wiggly green ideas
suggest furiously one or two operators to try first on a given node.
Recall
that each red operator is a rule of thumb. So the green strategies tell
us, in a given situation what the best heuristics are to apply.
<Remove Red-lines overlay>
Notice the references to time or situations that have crept in.
Instead of just knowing a rule of thumb like this one
<SLIDE: heurs.; point to line 1>
would be a more sophisticated piece of knowledge like
<point to line 2>
The left-hand-sides of these situation-dependent rules
wil typically triggr on
some experimental data that was just collected. Here, it was running
a predicate on random arguments and seeing what percentage of the
time it returned the value "True."
The basic idea so far is that discovery can be successfully
represented as a heuristic search procedure: <SLIDE: 3 tasks> A
continual expansion of a tree of concepts and relationships, guided
by some judgmental criteria, some informal rules of thumb.
Your memory of this talk will fade with time, but even if drops down
to just four words, I'd be happy if they were ⊗4these⊗* four words:
Discovery as Heuristic Search.
ONE CAN WRITE A THY. FORMATION PROGRAM
If this is true, one should be able to write a computer program which
proposes new concepts in some scientific domain. After picking the
field in which the system will work, there are 3 standard things
you'd have to do, the same things you always have to do to implement
a heuristic search:
.BEGIN INDENT 0,3,0 PREFACE 0
(1) specify the initial core of knowledge that it will start with,
(2) specify the legal ways of adding to that knowledge
(3) collect the strategies and tactics that experts in that field use
when trying to form theories.
.END
MATHEMATICAL THY. FORMATION
If anyone asks at the end, I'll discuss why mathematics is an ideal
field in which to try out these ideas. For now, assume that we've
settled on math. Before carrying out these 3 steps, I should at least
mention what I mean by a ⊗4mathematical concept⊗* and a ⊗4theory⊗*.
A mathematical theory <SLIDE: thy> is built on a ↓_⊗2FOUNDATION⊗*_↓
of primitive, undefined ↓_objects_↓, and ↓_operators_↓, plus some
postulates and axioms which interrelate them. Each of these
qualifies to be called a mathematical ↓_concept_↓.
Onto this foundation is built a tower of theorems -- statements which
are implied by the foundation -- plus some
definitions of new concepts in terms of preexisting ones.
This is the final, breath-taking architecture that we see in math
textbooks. <math textbook-order slide> All undefined objects and
axioms are stated, and then an infinite define-state-prove loop is entered.
Occasionally, it's broken by some post-facto motivation or real-world
application. This is entire process never involves backtracking.
What they ⊗4don't⊗* show you in textbooks is the unaesthetic scaffolding that was
required to erect that tower: that scaffolding is empirical induction
and good old-fasioned search, with back-tracking and everything.
Yes, Virginia, mathematics is an empirical science. Most of you know that
already, but you learned it from hard experience, not from any
algebra textbook.
Math research has
very little to do with the smooth flowing proofs you find there. It's
more like experimental physics, where you gather some data, look for
regularities, and induce some conjectures. If they can't be phrased
concisely with the existing terminology, that's a clue that some new
definitions should be made. Just like we did near the beginning,
introducing the term "prime numbers" to simplify the statement of the unique
factorization theorem.
<SLIDE: research-order> So if anything, math research
proceeds in almost the opposite order from the way it is finally
reported in books and journals.
This is done by empirical induction (1st line), by studying
experimental data and trying to perceive some regularities in it.
Most of the mathematician's empirical forays are fruitless, but there
is no way to know this before carrying them out. His only guidance
are his
intuitions about what to
investigate next, his purple and green
heuristic strategies.
What I'm going to describe now is a computer program which uses the
paradigm of heuristic search to develop simple mathematical theories
in this same way.
I'm talking about a real system,
written in INTERLISP. It lives at the PDP-10
at SUMEX, and occupies 100k. All the discoveries I have and will mention
have been made by the system working by itself.
<3tasks, revisited SLIDE> Recall that there is a standard set of
tasks to perform, to create such a system.
.SELECT 1
The first step was to list all the concepts the system would know
about initially. That is, a bunch of primitive notions which are the
starting state of the system's knowledge. But how do you decide
which concepts to include?
Three kinds of criteria came to mind:
First, a ⊗4↓_complete_↓⊗* set of concepts, from which all knowledge
in the universe could be derived. But this wouldn't fit into 256k.
More seriously, I tried to get a ⊗4minimal⊗* set of concepts, only
starting with concepts which are indispensable, fundamental in some
absolute sense. But as we saw earlier, the conceptual primitives of
you or me or the system are purely personal. Until recently, no one
realized that arithmetic could be derived from set theory.
Scrapping both complete and minimal criteria, the next one I thought
about was to isolate just those concepts which young children seem to
have, which Piaget might have called ⊗4pre-numerical⊗*. This turned
out to be a quite nice set, <SLIDE concept> and these are in fact the
concepts the system actually started with.
Included here are static structural concepts like sets and relations,
and active ones like substitute, union, reverse, compose 2 relations,
and so on. In all, about 100 concepts were called for. Note that
there is no notion of numbers or proof.
At a slightly deeper level, we have to decide precisely what
information the system will know about each of these concepts.
Certainly we should provide a ↓_definition_↓ or some other way to
distinguish among the concepts.
For each operation, we should also mention its ↓_domain and range_↓.
Perhaps we know of some ↓_abstract representation_↓ in which a
concept can be quickly manipulated analogically (e.g., Venn diagrams
for the concept of Sets).
Also, to help the system decide what to do, it would be nice to
supply some kind of rough judgment of the worth of each concept: its
interest, its usefulness, its simplicity, symmetry, etc. Now we're
veering from science into metaphysics, so let me just say that we
attach a number to each concept, which is supposed to represent its
worth. Empirically, the precise numbers turn out not to affect the
system's behavior much.
There are many other facets to each concept, <SLIDE facets>,
including Examples, Generalizations, Specializations, Analogies, and
so on.
This set of facets is fixed once and for all. Each concept can
possess any or all of these facets. But that's all we can ever know
about a concept. A concept ⊗4is⊗* simply a bunch of facets, drawn
from this fixed set. This idea is what I called the BEINGS
representation of knowlede a few years ago, and is subsumed by what
is now known as FRAMES.
In the LISP program implementing these ideas, each concept is stored
as an atom, and its facets are implemented as the properties on its
property list.
Let's look at one particular concept, and see what kinds of
information could go into each facet. We'll consider the concept
COMPOSE, which refers to the composition of two relations. We define
AoB(x) as A(B(x)). That is, apply B and then apply A to the result.
<COMPOSE slide> <go over each facet>. Here is an abbreviated
description of that concept.
Several equivalent definitions are provided. Some are computationally
efficient, and others are slow but especially easy to for the system
to analyze. Each definition is an executable LISP predicate, which
takes 3 arguments and tries to see whether the composition of the
first two is equal to the third.
Several algorithms are provided, again exhibiting this trade-off
between speed and transparency.
The specializations facet points to another concept which is similar
but only composes a relation with itself.
An ⊗4example⊗* of Composition is provided here.
One conjecture stored here is that composition is associative. This
was disocvered by the system, not put in by me. Since associativity
isn't known to the system, the entire statement of that bulky
relationship must be stored.
Down here we have some heuristics for dealing with compositions. An
explanation of why it's good for anything, how you can spot an
interesting compostion (e.g., if the d-r of the composition are
equal), hints for filling in certain facets of any new composition
which gets created, and something worth checking about each
composition.
Notice that the format of each kind of facet varies from productions
<heurs> to little programs <algs>, to bits of declarative data <d-r>.
Let me put back the list of concepts again, and also our list of
facets.<SLIDE>
What does the system actually do, now? From a heuristic search
viewpoint, the primary activity of the system is to create new
concepts. When it does so, however, most of their facets will be
blank, so that in practice most of the system's time is spent in
fleshing out some of the concepts which it has recently discovered,
filling in someof their facets.
So there are two ways that the system expands its knowledge. One way
is to find out more about the already-known concepts; the other way
is to define new ones. <SLIDE 2ways to grow>
In the LISP
implementation, that translates to filling out the properties of existing
data stuctures (here some examples of Compositions have been found),
and sometimes creating entirely new data structures
(Here the system takes one
particular example of Compose and promotes it
from being just one entry on one facet of one concept, into beinga full-fledged
concept with facets of its own).
What happens when a new concept is created? Well, at the moment of its
birth, the system knows
its definition, and something about its worth. Also how it
connects to some existing concepts.
So a few seconds are spent filling in all the facets whose contents are obvious.
Once it's created, most of its
facets will still be empty, so there will be many new things for the system
to do.
Since the
"fleshing out" activity is quite expensive, it would be a mistake to
just try to fill out all the facets of all the concepts.
For example,
the system starts off with 100 concepts. Of all the possible facets
they might possess, only a few are filled-in for each one; so there are
2000 blank facets initially.
If would take days of cpu time for them all to get filled in. And
each time a new concept was proposed, an hour might be shot working
just to flesh it out completely. The system must manage its time more carefully.
There must be some "fine structure" to the process of filling out
the tree of concepts. This is a non-standard problem, not
faced during you typical heuristic search. We can't permit the system to
completely fill out each new node it wants to look at.
The solution I use is to dynamicallt optimize the expected value of what the
system is doing. By that I mean that it must look at a bunch of very
plausible tasks, and choose the most promising of them to execute.
Each task is at the level of fillng out a particular blank facet of a particular
concept.
The LISP implementation of this idea is to have a global job-list,
<SLIDE joblist> an agenda of suggested activities to do, an ordered
list of candidates for the system's attention. Each entry consists
of a specific task for the system to work on, a priority value, and a
list of reasons why this is a reasonable task to do.
Except for this list of symbolic reasons, this is the same as the agenda
mechanism proposed by Winograd and Bobrow for their new language, KRL.
At some moment, there might be many entries on the agenda, one
indicating that the system could Generalize the relation
Object-equality somehow, one suggesting that the system fill-in
some examples of the newly
discovered concept Primes, and so on.
This agenda governs the top-level control structure of the program.
The system picks the highest priority job off this list, and executes it.
In the process of performing a job, 3 kinds of things can happen:
.BEGIN PREFACE 0 INDENT 3
(1) blank facets may get filled in,
(2) new concepts created, and
(3) new jobs added to the agenda.
.END
In practice, so few jobs ever get executed that we only keep around the
top 100 or so at any moment.
Each new job gets entered by a
heuristic, for example this one is suggested by a heuristic which
recognized that there were too few random pairs of objects that
turned out to be equal. At the moment it is entered, whoever
suggested the job will know some pretty definite reasons for why it
might be worth doing. So he attaches a list of reasons to it. That
proposer, that heuristic, also assigns to each reason a numeric
rating between 0 and
1000.
One global formula then looks at the job and the reasons, and
assigns it a single priority rating (also from 0 to 1000).
<SLIDE: Formula> That formula was just guessed at as
a first pass, and it's worked well enough not to tamper with it. It
involves squareroots and dot-products of vectors and isn't really
worth delving into.
Each concept, each facet, each job has a number attached to
it. This formula uses them to attach an interestingness factor to
any given job on the agenda.
Each reason is actually a token, but as you can imagine they're quite
useful for keeping the human user aware of what the system is doing
and why.
A surprisingly important mechanism comes into play when a job is
proposed which turns out to be ⊗4already⊗* on the agenda. <OVERLAY:
new job> If the job is being proposed for pretty much the same
reasons as before, then its priority rating is incremented only
slightly. But if it is being proposed for a new reason, then its
priority is increased tremendously.
This is the practical importance of maintaining reasons for each job.
Notice how nondeterministic the flow of control is. The system looks
over the job-list, picks the job with the highest priority, and tries
to execute it. That priority rating then dictates how to manage the
system's resources: for example, the number of cpu seconds to spend
before giving up and trying the next entry on the agenda.
The number of list cells a facet can occupy before quitting, and so on.
(This idea was inspired by Carl Hewitt's notion of activation energy.)
.COMMENT OMIT IF SHORT ON TIME;
We are assuming that there is a self-consistent set of computational
rules for manipulating the estimated worth of all the concepts and
jobs.
In my own modest way, I call this scheme a zeroth-order calculus of
interestingness factors. Predicate calculus tries to preserve and
manipulate validity. Carnap and Hintikka's Lambda-calculus lets you
manipulate certainty factors. This first pass at a calculus of
priority values tries to preserve and manipulate interestingness
factors or estimated worth. Much work remains to be done on it until
it deserves more just mentioning in passing.
Let me warn you about using this scheme, however. The whole idea of
a global priority value is quite dangerous.
Comparing the worths of "compose" and "sets" is
really
comparing apples to oranges.
I think that this scheme works adequately here only because
neither the heuristics nor the jobs are strongly
coupled; they are all pretty much independent of each other.
For example, it
doesn't really matter which order the top 2 jobs get executed in;
no one cares whether we look for primes right before generalizing equality,
or right afterwards.
What ⊗4does⊗*
matter is that they are near the top, not near the bottom, of the agenda.
That's why a crude kind of evaluation function is all you need.
.SELECT 1
Just as with the starting concepts, a question arises as to how to
collect the starting ⊗4heuristics⊗* that will guide the system.
The standard way to do this, in any domain-specific knowledge-based
system, is to have some experts from that field introspect on all the
clues they use, all the knowledge that guides their research efforts.
Unfortunately, most scientists' heuristics are demon-like.
That is, they only occur to our conscious minds in situations where
they might plausibly be used. We grossly underestimate the number of
such tactics that we use every day. We just never access them out of
context.
What I did was to consider each facet of each concept in turn, and
try to think of all the heuristics that I might use in that context.
In that way, several hundred heuristics were found.
They fell into a few major types: Some of them suggest new concepts
to consider <SLIDE: heurs> (like "look at the inverse") but these
rarely lead to cost-effective results. Much more valuable are the
ones which give hints for ⊗4when⊗* to do things ("if you can't find
any examples of when a predicate is satisfied, then try to generalize
it").
Also valuable are those which tell ⊗4how⊗* to do something ("to fill
in examples of A, find an operator whose range is A,
and apply it").
Finally, there are heuristics for filling in new heuristics (like
"after using the `extremals' heuristic to derive a new subset b↑-↑1
of A, then add the heuristics that ` b↑-↑1 is a good set to use when
working with or trying to derive conjectures involving the
operationsing f and f↑-↑1. ' "). That's not clear, so let me give
you an example of how we'd use this.
The system discovered primes, using the look-for-extremals heuristic.
⊗4This⊗* rule then said to create a brand new
heuristic which read:
.ONCE INDENT 0 PREFACE 0
⊗2"Primes are useful to generate and to test conjectures involving
multiplication and divisors"⊗*
Because f and f↑-↑1 are divisors and multipliction, b↑-↑1 is the set
of prime numbers, A is the set of numbers.
There are very few of
these that I know of, but each one can have a dramatic impact on the
system's behavior.
By and large, the valuable heuristics all have the form "in situation
S, consider doing X". In my system, therefore, each heuristic is
written as a production rule.
Here's how this heuristic (if few exs, genlize)
would have to be stated <SLIDE: detailed heur:genl>
Note that despite even this amount of detail we're still talking one level above LISP.
.SELECT 1
With hundreds of heuristics, the system can't afford to run throught them all
in each situation.
Note that this is ⊗4not⊗* a standard problem faced by most heuristic
searches. Like the human mathematician, the system should only
access a strategy in those situations where it could plausibly be
used. Each heuristic was generated while contemplating one
particular concept, so we keep it tacked on to that particular concept.
For example, some of the heuristics tell whether a given composition
is interesting or not. There's no reason to access them unless we're
trying to judge a composition. So we keep them attached to the
COMPOSE concept, as just another facet.
<SLIDE: genl branch> Say we want to see how interesting this
particular operation is. Then we could use evaluation heuristics
gathered from any of these concepts.
Any reason why an operation might be interesting will also be valid here.
The system would ripple upward,
in the direction of increasing generality, collecting heuristics as
it went. They would be evaluated, and the average result would be
taken as the worth of this concept.
This is why the relevant heuristics were listed before as a facet of
each concept, like domain/range, examples, and definition. That's
also a nice symmetric way to think of things. Formally, it's
equivalent to having another test on the left-hand side of each
heuristic rule, each production, to select for the concept the
heuristic is associated with. But computationally, it is much more
efficient to keep them separated.
If we're dealing with a bunch of concepts, we ripple upward from all
of them. All the heuristics that are gathered up get dumped together
into one little production system, which is then run to produce a
value.
This works in practice because the heuristics rarely couple, in the sense that
Newell defines coupling; the order in which they are executed is not
too important.
They don't interact much. So it's trivial to assemble them and run
them together.
Simple schemes like this only work when the components of the system
are very independent, when the problem was easy to begin with.
Tight interdependence would require a more sophisticated approach.
.SELECT 1
By the way, using one of the system's own heuristics, if something is
interesting then it's worth naming it. I think the program has become
just interesting enough now to warrant a name; let's call it AM.
Let's finally take a look at the program in action.
.SELECT 1
CARDINALITY EXAMPLE: OPTIONAL
This is an actual excerpt from a session with AM, as it first discovers
Cardinality, the concept of numbers. AM already knew about sets,
bags, lists, etc., and equality of 2 sets. A bag is a multiset, an
unordered structure with repeated elements allowed.
<Cardinality slide>
I'll read through it once to show you that AM really performs magic,
then we'll peek behind the curtains to see that AM is really no more
than the straight-forward data-structure expander that I've been
describing.
(1) After testing random structures to see if they are equal, AM
concludes that it would be worthwhile to generalize the predicate
EQUAL. The user asks "why" and AM mentions that few randomly selected
objects turned out to be equal to teach other.
(2) Two generalizations of EQUAL are constructed. The first is turns
out to be the relationship "has the same length as", and the second is "has the
same first element as".
(3) After testing random structures to see if they have the
SAME-LENGTH, AM decides to try to find a canonical form for all
structures. This would mean that two structures were of the same
length if and only if their canonical forms were identically equal.
(4) AM comes up with the idea of making this conversion by (i)
converting each structure to a BAG, and (ii) replacing each element
by the constant T. Thus the canonical form of all structures of
length 10 is a BAG of 10 T's. AM decides to restrict various
operations like Union and Intersection to such canonical structures,
and see what happens.
The user tells AM that such canonical structures should be called
NUMBERS.
Well, now that you've seen what AM says, let's see how it does it.
********************************************************************
To start off slow, let's examine the ways that AM found examples of
things which were EQUAL. The highest-priority job was selected. It
said to fill in the EXAMPLES facet of the EQUAL concept. To do that,
AM gathered heuristics by rippling outward.
One of these rules of thumb said to instantiate the definition of
the predicate EQUAL. Here is the recursive definition of EQUAL: <defn
of EQUAL slide>. An easy way to instantiate this, to satisfy this,
would be simply to satisfy this base step. To do that, the two
arguments must be identical atoms. So some examples like
NIL=NIL are produced.
Or we could invert this recursive step, by plugging in here objects
alredy known to be equal. But we already know that NIL=NIL. So the Cars
of 2 structures
are both NIL, and their CDRs are both NIL, then the structures will be
Equal. SO an example like (List NIL) = (List NIL) is generated.
An entirely different approach is suggested by a heuristic tacked onto the
Activity concept. It says that, when filling in examples of any
activity or predicate, we can randomly instantiate the arguments, run
the concept, and observe the results. So AM randomly picks pairs of
objects and sees if they satisfy EQUAL. This is where the empirical
data comes from that tells AM how rarely EQUAL is satisfied. We
already saw the heuristic that said to consider generalizing the
definition of a predicate, if very few examples can be found. So a
new job is added, which says to generalize the definition of EQUAL.
Since it has a good reason, it soon rises to the top of the job-list.
********************************************************************
AM is now trying to generalize the definition of EQUAL. AM's first
magic trick is turning equality into equal-length. Look at this definition.
Two objects are equal if they are identical atoms, or
if they are both lists and their cars are equal and their cdrs are
equal. This is the recursive definition that AM wants to generalize.
Notice that it is composed of a base step <point> and two recursive
calls which must both succeed. In such a case, an easy way to
generalize is to assume that one of them will always succeed. That
is, force the predicate to recur successfully in only one direction,
either the CARs or the CDRs, but not both. In particular, by doing
away with the test in the CAR direction <defn OVERLAY slide>, we get
a predicate which counts down two lists, and returns T if and only if
they both become empty at the same time. That is, iff they have the
same length.
The second magic trick was deriving the canonical form of a
structure. AM did this by selective experimentation, which I'll describe
later if anyone asks.
This whole development was done to satisfy just 3 jobs from the agenda
list:
.BEGIN FILL INDENT 4,0,0 PREFACE 0
(Fillin Exs. of Equality)
(Generalize Defn. of Equality)
(Canonize Same-length wrt Equality)
.END
.SELECT 1
GOOD RUN
Everything that I've described has been coded and debugged, and all
the examples run, but I always seem to have 20 pages of notes about
additions I have to make to AM, and better ways to do things and so
on.
Let me describe the best run so far, which occurred last month, in
fact. The system started with the 100 prenumerical concepts I showed you
earlier, and after 1
cpu hour, it had derived about 100 new ones. Half of those were real
losers, but the rest were interesting. Some deep chains of
discoveries were in fact made. <SLIDE chain> Note it went through
the whole chain of discoveries I talked about during this talk, all
the way from set-theoretic notions to numbers to unique
factorization.
AM did this all on its own; if I sit at the keyboard and guide it along,
the same discoveries can be made in about a third the cpu time.
200 jobs from the agenda were executed during this period.
Each job was allocated from 1 to 100 cpu
seconds, depending on its priority value. Most jobs were granted
about 30 seconds, but used only 20 or 25.
The 1-hour run ended because of space problems. I'm now going to save the
best of the derived concepts, throw away the rest, and re-start AM from that
point. I haven't tried that yet.
How general is AM?
I think that the basic design could be used to form theories in many
empirical sciences; the only problem is obtaining the all the data that
AM would ask for; each time it proposed an experiment, for example.
That would be trivial in most formalizable fields, like physics or
other domains of mathematics. It would also be easy for Programming,
where doing an epxeriment would mean doing an EVAL. It would be ridiculous
in softer fields like sociology, where doing an experiment entailed
months or years of effort.
In any domain, including the simple one I chose, the major problem is going
to be providing heuristics which are powerful enough
to recognize quickly whether each new concept is going to be
dead-end.
The most exciting aspect of the project is the ability to perform
experiments on the system: modify the starting concepts, the
heuristics, and observe the results. This would be using AM as a
tool, to explore the impact of various pieces of knowledge. We're
beginning such experiments with AM this month.
MAXIMALLY-DIVISIBLE NUMBERS
After spending a year convincing my reading
committee not to expect any new mathematical knowledge to emerge, one
brand new mathematical theory was accidentally discovered.
I'll close this talk by indulging
my maternal pride, and
show you one slide about it. <OVERLAY: Maxdivis>
Just as AM decided to look at numbers with extremely ⊗4few⊗* divisors,
it thought to look at those with abnormally ⊗4many⊗* divisors.
<SLIDE AM conjec> For any number d, the
smallest number with at least d divisors is of this form, where these
exponents decrease inversely as the logs of the primes. For example,
here's the smallest number with 4 million divisors, and in fact the
exponents of its prime factors satisfy this constraint. Although it
required a couple humans to precisely state and prove this
constraint, it was AM which suggested what to look for.
In fact, I happened to mention to a friend of mine that the system
was wasting a lot of time in this worthless area, but that it always
had a good reason for what it was doing. He suggested that maybe this
concept ⊗4was⊗* interesting, and, as in the old Hilbert joke, after a
day or two of thought I concluded it ⊗4was⊗* intuitively interesting.
Ideally, one could run AM overnight, and
wake up the next morning to find five or ten more or less interesting
new concepts to look at. It would still be the researcher who made
the highest-level selections of what to do next, much like a
professor directing graduate students.
This is the ultimate use that I hope such systems will be
put to -- probably in the next centry:
to work alongside a researcher, to free him from doing all
the routine detective work.
.SKIP TO COLUMN 1
**********************************************************************
SUPPLEMENTARY: Finding canonizing function for numbers
Experiment number 1 was to convert the arguments from one type of
structure to another, and see if that changed the value of
SAME-LENGTH when it was applied to them. It turned out not to, which
meant that one single canonical type of structure could be chosen.
But there are 4 kinds it could be: a set, a list, a bag, or an
ordered set. The particular kind of structure depends on 2 things:
whether altering the order of the elements in the arguments makes any
difference to SAME-LENGTH, and whether adding multiple occurrences of
the same element makes any difference to SAME-LENGTH.
But changing the order makes no difference, so the canonical
structure type must be unordered, either BAG or SET. But adding
multiple elements ⊗4does⊗* make a difference, so the type must be BAG
or LIST. The only possible canonical type is therefore BAG.
Next, AM notices a message in the Ties facet of SAME-LENGTH, that
explicitly says that it's the same as Equality, except that
SAME-LENGTH does not recurse in the CAR direction. The CAR of an atom
is its value cell, so that message is enough evidence to warrant
testing whether or not the value cells of any of these elements are
ever accessed. Empirically, they never are. So AM assumes it can
replace them all by a single, fixed value: say "T".
Putting this all together, we see the canonizing transformation as:
convert the structure to a BAG, and substitute T for each element.
**********************************************************************
Another thought I hope you'll leave with today is that (for this task
at least) if the heuristic operators are sophisticated enough, then a
simple numeric calculus is all the meta-level heuristics you need for
adequate performance. And you don't need any explicit
meta-meta-heuristics at all. It would be interesting to see if this
is really true for other tasks as well. I suspect tht this is true
whenever the heuristics are fairly independent of each other.
One problem we all face is how to cope with changes in a system's
knowledge base. The standard solution is to track down all the
effects of each change, and update the whole data base. The solution
AM uses, and perhaps people too, is to just record the changed
information, and ⊗4not⊗* to update everything. When a contradiction
of some kind occurs, then the conflicting knowledge is checked, the
knowledge ⊗4it⊗* was based on is checked, and so on, until we
encounter some changed opinion which explains the disagreement. So
the system is allowed to hold contradictory viewpoints, so long as it
could resolve any dispute if it had to. This is one solution you
might keep in mind for your knowledge-based systems. I think that
⊗4this⊗* works because there are so few contradictions; that in turn
is due to the independence of the resoning involved.